Intro and Problem

Loading the Data

Concrete_Data <- read_excel("Concrete_Data.xls")

colnames(Concrete_Data) <- c("Cement", "Blast Furnace Slag", "Fly Ash",
                             "Water", "Superplasticizer", "Coarse Aggregate",
                             "Fine Aggregate", "Age", "Concrete Compressive Strength")
sum(is.na(Concrete_Data))
## [1] 0

We loaded the data into R from the Excel file it was originally stored in. The column names were all too long, so we changed them to make them more friendly to use. We then checked to see if there were any missing values in the dataset. It makes sense that there weren’t any as this data was taken from the UCI Machine Learning Repository and was already cleaned.

Building the Predictive Model

set.seed(123)

trainingSetIndex <- createDataPartition(Concrete_Data$`Concrete Compressive Strength`, p = 0.75, list = FALSE)

trainData <- Concrete_Data[trainingSetIndex, ]
testData <- Concrete_Data[-trainingSetIndex, ]

Before we could start building the model, we had to randomly partition the data into two separate matrices. The first partition contained 75% of the data and would be used to train the model. The second partition contained 25% of the data and would be used to test the accuracy of our predictive model.

mod3 <- train(`Concrete Compressive Strength` ~ ., data = trainData, method = "lm", preProcess = c("scale", "center"), trControl = trainControl("none"))

mod3_training <- predict(mod3, trainData)
mod3_testing <- predict(mod3, testData)

We wanted to see how the model would perform before doing any tranformations or further analysis. So, we used the training data to create a linear model regressing concrete compressive strength against all the independent variables. The data was preprocessed by standardizing the values. We then applied the model to make predictions based on the training data and the testing data.

Plotting the Training Model

trainingDF_mod3 <- data.frame(trainData$`Concrete Compressive Strength`, mod3_training)

a = ggplot(data = trainingDF_mod3, aes(x = trainData..Concrete.Compressive.Strength., 
  y = mod3_training)) + 
  geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") + 
  theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") +
  xlab("True Training Values") + ylab("Training Values Predicted by the Model") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))

ggplotly(a)

Shown above is the plot of the actual concrete compressive strength values in the training dataset (x-axis) versus the concrete compressive strength values predicted by our model (y-axis). If our model had predicted the values with 100% accuracy, the points on the graph would have a perfect correlation. Obviously, that is not the case. As we move farther right on the x-axis, the data appear to spread out quite a bit.

Plotting the Testing Model

testingDF_mod3 <- data.frame(testData$`Concrete Compressive Strength`, mod3_testing)

b = ggplot(data = testingDF_mod3, aes(x = testData..Concrete.Compressive.Strength., 
  y = mod3_testing)) + 
  geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") + 
  theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") + 
  xlab("True Training Values") + ylab("Training Values Predicted by the Model") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))

ggplotly(b)

Shown above is the plot of the actual concrete compressive strength values in the testing dataset (x-axis) versus the concrete compressive strength values predicted by our model (y-axis). We see a pattern here that is similar to the previous plot. The variation in the data appears to increase as we move farther right on the x-axis.

Basic Model Summary

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.544  -6.502   0.606   6.626  34.618 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 35.8901     0.3764  95.362  < 2e-16 ***
## Cement                      11.9899     1.0381  11.550  < 2e-16 ***
## `\\`Blast Furnace Slag\\``   8.2480     1.0142   8.133 1.69e-15 ***
## `\\`Fly Ash\\``              5.0552     0.9465   5.341 1.22e-07 ***
## Water                       -3.5324     0.9781  -3.612 0.000324 ***
## Superplasticizer             1.9282     0.6500   2.967 0.003105 ** 
## `\\`Coarse Aggregate\\``     1.0195     0.8509   1.198 0.231227    
## `\\`Fine Aggregate\\``       1.0628     0.9799   1.085 0.278437    
## Age                          7.3198     0.4012  18.244  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 765 degrees of freedom
## Multiple R-squared:  0.6181, Adjusted R-squared:  0.6141 
## F-statistic: 154.8 on 8 and 765 DF,  p-value: < 2.2e-16

Looking at the summary, we see the model has an intercept of 35.8901 and two of the most influential variables are cement and blast furnace slag, whereas two of the least influential variables are coarse aggregate and fine aggregate. This has a mediocre Adjusted R-squared value of 0.6141, which implies roughly 40% of the variation in the data cannot be explained by the model.

Plotting the Residuals

The residuals of the model appear to have a normal distribution and are symmetric about 0, which suggests that our model fits the data well.

Improving the Model

##                                    Cement Blast Furnace Slag      Fly Ash
## Cement                         1.00000000        -0.27519344 -0.397475440
## Blast Furnace Slag            -0.27519344         1.00000000 -0.323569468
## Fly Ash                       -0.39747544        -0.32356947  1.000000000
## Water                         -0.08154361         0.10728594 -0.257043997
## Superplasticizer               0.09277137         0.04337574  0.377339559
## Coarse Aggregate              -0.10935604        -0.28399823 -0.009976788
## Fine Aggregate                -0.22272017        -0.28159326  0.079076351
## Age                            0.08194726        -0.04424580 -0.154370165
## Concrete Compressive Strength  0.49783272         0.13482445 -0.105753348
##                                     Water Superplasticizer
## Cement                        -0.08154361       0.09277137
## Blast Furnace Slag             0.10728594       0.04337574
## Fly Ash                       -0.25704400       0.37733956
## Water                          1.00000000      -0.65746444
## Superplasticizer              -0.65746444       1.00000000
## Coarse Aggregate              -0.18231167      -0.26630276
## Fine Aggregate                -0.45063498       0.22250149
## Age                            0.27760443      -0.19271652
## Concrete Compressive Strength -0.28961348       0.36610230
##                               Coarse Aggregate Fine Aggregate          Age
## Cement                            -0.109356039    -0.22272017  0.081947264
## Blast Furnace Slag                -0.283998230    -0.28159326 -0.044245801
## Fly Ash                           -0.009976788     0.07907635 -0.154370165
## Water                             -0.182311668    -0.45063498  0.277604429
## Superplasticizer                  -0.266302755     0.22250149 -0.192716518
## Coarse Aggregate                   1.000000000    -0.17850575 -0.003015507
## Fine Aggregate                    -0.178505755     1.00000000 -0.156094049
## Age                               -0.003015507    -0.15609405  1.000000000
## Concrete Compressive Strength     -0.164927821    -0.16724896  0.328876976
##                               Concrete Compressive Strength
## Cement                                            0.4978327
## Blast Furnace Slag                                0.1348244
## Fly Ash                                          -0.1057533
## Water                                            -0.2896135
## Superplasticizer                                  0.3661023
## Coarse Aggregate                                 -0.1649278
## Fine Aggregate                                   -0.1672490
## Age                                               0.3288770
## Concrete Compressive Strength                     1.0000000

In an attempt to improve our predictions, we decided to look at the correlations between every combination of the different variables. We wanted to know if any of the independent variables had a strong correlation with concrete compressive strength as well as if there was any interaction between two of the independent variables. Upon plotting age versus concrete compressive strength, we noticed there was a logarithmic relationship between the two variables. We also noticed a decent correlation between both water and superplasticizer as well as water and fine aggregate.

Building the Improved Predictive Model

mod4 <- train(`Concrete Compressive Strength` ~ Cement + `Blast Furnace Slag` + sqrt(`Fly Ash`) + Water  + sqrt(Superplasticizer) + `Coarse Aggregate` + `Fine Aggregate` + log(Age) + Water*Superplasticizer + Water*`Fine Aggregate`, data = trainData, method = "lm", preProcess = c("scale", "center"), trControl = trainControl("none"))
  
mod4_training <- predict(mod4, trainData)
mod4_testing <- predict(mod4, testData)

Based on the discoveries made about the relationships between some of the variables, we attempted to improve our model by taking the natural log of age and adding two interaction terms to our model. We again used the training data to create a linear model regressing concrete compressive strength against the independent variables and interaction terms. The data was preprocessed by standardizing the values. We then applied the model to make predictions based on the training data and the testing data.

Plotting the Improved Training Model

trainingDF_mod4 <- data.frame(trainData$`Concrete Compressive Strength`, mod4_training)

c = ggplot(data = trainingDF_mod4, aes(x = trainData..Concrete.Compressive.Strength., 
  y = mod4_training)) + 
  geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") + 
  theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") + 
  xlab("True Training Values") + ylab("Training Values Predicted by the Model") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))

ggplotly(c)

Shown above is the plot of the actual concrete compressive strength values in the training dataset (x-axis) versus the concrete compressive strength values predicted by our improved model (y-axis). As you can see, there is a strong correlation present in the data and, unlike our first model, the variation stays fairly constant.

Plotting the Improved Testing Model

testingDF_mod4 <- data.frame(testData$`Concrete Compressive Strength`, mod4_testing)

d = ggplot(data = testingDF_mod4, aes(x = testData..Concrete.Compressive.Strength., y = mod4_testing)) + 
  geom_point(alpha = 5/8) + geom_smooth(method = "lm", se = FALSE, color = "red") + 
  theme_bw() + ggtitle("Concrete Compressive Strength (in mPa)") + 
  xlab("True Testing Values") + ylab("Testing Values Predicted by the Model") + 
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16))

ggplotly(d)

Shown above is the plot of the actual concrete compressive strength values in the testing dataset (x-axis) versus the concrete compressive strength values predicted by our improved model (y-axis). We see a pattern here that is similar to the previous plot. The variation appears to remain constant and there is a strong and obvious positive correlation in the data.

Improved Model Summary

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5786  -4.4243  -0.2063   4.1486  25.3419 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   35.8901     0.2423 148.122  < 2e-16 ***
## Cement                        12.8595     0.6120  21.012  < 2e-16 ***
## `\\`Blast Furnace Slag\\``     8.3504     0.6221  13.423  < 2e-16 ***
## `sqrt(\\`Fly Ash\\`)`          3.4522     0.6484   5.324 1.33e-07 ***
## Water                         -7.8561     2.2810  -3.444 0.000604 ***
## `sqrt(Superplasticizer)`      12.5998     1.4290   8.817  < 2e-16 ***
## `\\`Coarse Aggregate\\``       1.3060     0.5087   2.568 0.010433 *  
## `\\`Fine Aggregate\\``        -4.1261     2.3197  -1.779 0.075682 .  
## `log(Age)`                    10.3688     0.2526  41.040  < 2e-16 ***
## Superplasticizer               4.3590     2.8637   1.522 0.128384    
## `Water:Superplasticizer`     -13.5814     3.3277  -4.081 4.95e-05 ***
## `Water:\\`Fine Aggregate\\``   6.4522     2.5733   2.507 0.012369 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.741 on 762 degrees of freedom
## Multiple R-squared:  0.8423, Adjusted R-squared:   0.84 
## F-statistic:   370 on 11 and 762 DF,  p-value: < 2.2e-16

Looking at the summary, we see the model still has an intercept of 35.8901. However, the two most influential variables have changed to superplasticizer and the interaction term between water and superplasticizer. The two least influential variables are now coarse aggregate and fly ash. This model has a decent (and much improved) Adjusted R-squared value of 0.84, which implies roughly 16% of the variation in the data cannot be explained by the model.

The residuals of the model appear to have a normal distribution and are symmetric about 0, which suggests that our model fits the data well.

Specific Comparisons

##    Basic Model Prediction Improved Model Prediction
## 5                61.56700                 47.620266
## 10               31.93544                 35.517863
## 23               20.94199                  6.852754
## 89               50.93369                 40.077095
##    Actual Concrete Compressive Strength
## 5                             44.296075
## 10                            39.289790
## 23                             8.063422
## 89                            35.301171

For comparison, we have included some values of the concrete compressive strength as predicted by both of our models and the actual value for some given rows in the raw dataset. As you can see, the predictions have improved thanks to the new model. However, this isn’t true for every data point.

Conclusion

Using our data manipulation techniques, we were able to estimate the Concrete Compressive Strength based on the age and quantity of certain ingredients (input variables).

Dataset Citation

I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).